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Abstract 

A  probability  sequence  is  an  ordered  set  of  probability  forecasts  for  the  same 
event.  Although  single-period  probabilistic  forecasts  and  methods  for  evaluating 
them  have  been  extensively  analyzed,  we  are  not  aware  of  any  prior  work  on 
evaluating  probability  sequences.  This  paper  proposes  an  efficiency  condition 
for  probability  sequences  and  shows  properties  of  efficient  forecasting  systems, 
including  memorylessness  and  increasing  discrimination.  These  results  suggest 
tests  for  efhciency  and  remedial  interventions  for  inefficient  systems. 

1  Background 

A  probability  forecast  is  an  estimate  of  the  probability  that  a  precisely  defined 
event  will  occur.  A  probability  sequence  is  an  ordered  set  of  probability  forecasts 
for  a  single  verifying  event.  An  example  currently  appearing  in  the  press  is  the 
forecast  for  the  2014  US  Senate  election  at  fivethirtyeight.com, ^  which  in  March 
of  2014  forecast  a  Republican  Senate  majority  in  the  114*^  Congress  with  a 
probability  of  0.508.  The  most  recent  earlier  forecast,  issued  in  July  of  2013, 
was  for  a  Republican  majority  with  a  probability  of  0.504.  The  forecast  will 

^http://fivethirtyeight.com/features/fivethirtyeight-senate-forecast/,  accessed  May  2014 
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date  forecast  issued 

July  1  July  29  Sept.  2  Sept.  30  Oct.  31  Nov.  6  Outcome 
0.425  0.550  0.475  0.198  0.081  0.047  0 

Table  1:  Probability  forecasts  issued  in  2012  by  the  New  York  Times  fivethir- 
tyeight  blog  for  the  Republican  party  to  win  a  Senate  majority  in  the  113*^  US 
Congress. 

be  updated  more  frequently  as  the  election  approaches,  forming  a  probability 
sequence  for  the  event  that  the  Republicans  win  the  Senate  majority  in  the 
114i/i  yg  Congress.  A  similar  algorithm  was  used  to  forecast  the  2012  US 
national  elections,  and  issued  probability  forecasts  at  least  weekly  starting  in 
July  2012  until  the  eve  of  the  election.^  For  example,  six  of  the  forecasts  for  a 
Republican  Senate  majority  in  the  113*^  Congress  are  given  in  Table  1.  Wang 
and  Campbell  (2013)  describe  the  fivethirtyeight  forecasting  system  and  other 
statistical  modeling  for  election  outcomes  including  their  own. 

Another  familiar  example  of  a  probability  sequence  is  the  National  Hurricane 
Center’s  (NHC’s)  probability  forecast  for  winds  exceeding  a  given  threshold, 
generated  every  six  hours  over  the  course  of  the  storm  (DeMaria  et  al.  2009). 
The  wind-speed  probability  product  estimates  the  probability  of  one-minute 
sustained  winds  exceeding  each  of  three  thresholds  within  six-hour  (or  longer) 
periods  for  each  half-degree  cell  in  the  region.  For  example,  the  NHC  forecasts 
for  tropical-storm  force  winds  (winds  exceeding  34  knots)  for  2012’s  Hurricane 
Sandy  affecting  New  York  City  for  two  24-hour  periods  are  given  in  Table  2. 
Two  sequences  are  shown,  for  two  distinct  (though  not  independent)  events. 

In  contrast  to  the  outcome  of  an  election,  whose  resolution  is  clearly  tied 
to  a  specific  time  (election  day),  probability  forecasting  for  time-series  variables 
highlights  the  importance  of  the  distinction  between  a  rolling-horizon  forecast 
and  a  fixed-event  forecast.  Every  24-hour  period  has  a  maximum  sustained 

■^http:/ /fivethirtyeight. blogs. nytimes.com/fivethirtyeights-2012-forecast/,  accessed  May 
2014. 
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event  date 

Oct.  25 

Oct.  26 

date  forecast  issued 
Oct.  27  Oct.  28 

Oct.  29  Oct.  30 

Oct.  28 

0.06 

0.06 

0.11 

0.04 

0 

Oct.  29 

0.14 

0.31 

0.42 

0.80 

0.92  1 

Table  2:  NHC’s  probability  sequence  issued  at  11  am  EDT  each  day  for  winds 
exceeding  34  knots  in  NY C  between  8  am  EDT  on  the  event  date  indicated  and 
8  am  the  following  day.  Boldface  indicates  the  outcome. 

date  forecast  issued 

Oct.  24  Oct.  25  Oct.  26  Oct.  27  Oct.  28  Oct.  29  Oct.  30 
<0.01  <0.01  0.06  0.42  <0.01  <0.01  <0.01 

Table  3:  NHC’s  probability  forcasts  issued  at  11  am  EDT  for  winds  exceeding 
34  knots  in  NYC  between  8  am  EDT  two  days  later  and  8  am  three  days  later. 
This  is  an  example  of  a  rolling-horizon  forecast  and  is  not  a  probability  sequence 
as  defined  in  this  paper. 


wind  in  NYC.  The  event  that  the  maximum  sustained  wind  exceeds  34  knots 
between  8  am  on  October  28  and  8  am  on  October  29  is  a  distinct  event  from 
the  event  that  the  maximum  sustained  wind  exceeds  34  knots  between  8  am 
on  October  29  and  8  am  on  October  30.  In  the  context  of  forecast  sequences, 
a  sequence  is  a  set  of  forecasts  for  an  event  whose  definition  does  not  change. 
We  are  interested  in  the  relationships  among  multiple  forecasts  for  the  fixed 
event,  not  in  relationships  among,  for  example,  the  48-hour  lead  forecasts  for 
maximum  sustained  winds  on  different  dates.  Table  3  shows  the  rolling-horizon 
forecast  for  maximum  sustained  wind  speeds  exceeding  34  knots  at  NYC  during 
the  period  approximately  45  to  69  hours  after  the  11  am  issuance  of  the  forecast. 
Note  that  the  forecasts  issued  October  26  and  27  correspond  to  the  fixed  event 
forecast  for  the  event  dates  October  28  and  29,  respectively,  shown  in  Table  2. 

A  forecasting  system  is  a  process  for  generating  a  probability  forecast  for 
many  similar  events,  such  as  many  elections  or  winds  experienced  in  many 
storms,  or  many  locations  or  verifying  time  periods.  The  forecasting  process 
may  be  subjective,  statistical,  dynamical,  or  some  combination  of  the  above. 
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An  example  of  subjective  forecasts  are  the  individual  probabilistic  forecasts  for 
ranges  of  change  in  economic  variables  such  as  national  gross  domestic  product 
and  price  levels  collected  in  periodic  Survey  of  Professional  Forecasters,  in  the 
US  by  the  Federal  Reserve  Bank,  and  in  Europe  by  the  European  Central  Bank. 
Another  example  of  subjective  forecasts  are  individual  probabilistic  forecasts 
for  various  world  events  elicited  as  part  of  the  Intelligence  Advanced  Research 
Projects  Activity  (lARPA)’s  Advanced  Contingent  Estimation  Program,^  one 
instance  of  which  is  described  in  Mellers  et  al.  (2014). 

Purely  statistical  forecasting  systems,  such  as  the  fivethirtyeight  election 
forecasting  system,  use  historical  and  event-specihc  data  to  estimate  the  prob¬ 
ability  of  events.  Dynamical  forecasting  systems,  such  as  numerical  weather 
prediction  models,  may  be  used  in  simulation  mode,  as  in  ensemble-based  fore¬ 
casting  systems,  to  produce  probability  forecasts  (Sivillo,  Ahlquist,  and  Toth 
1997).  Prediction  market  prices  may  also  be  interpreted  as  probabilistic  fore¬ 
casts  reflecting  a  consensus  of  market  participants,  although  they  are  not  neces¬ 
sarily  calibrated  to  the  market  participants’  mean  subjective  probability  (Man- 
ski  2006). 

Many  forecasting  systems  are  combinations  of  subjective  and  statistical  fore¬ 
casts  —  for  example  averaging  or  otherwise  post-processing  subjective  forecasts 
from  many  individuals,  as  in  Baron,  Mellers,  Tetlock,  Stone,  and  Ungar  (2014), 
to  produce  a  distinct  forecasting  system.  The  system  consists  of  the  method  for 
eliciting  individual  subjective  forecasts  together  with  the  aggregation  process. 
Most  meteorological  ensemble-forecasting  systems  also  include  post-processing 
to  produce  a  probabilistic  forecast,  for  example  adjusting  for  underdispersion  in 
the  ensemble. 

We  consider  the  evaluation  of  a  forecasting  system,  theoretically  capable  of 
issuing  forecasts  for  an  infinite  set  of  events.  For  a  binary  event,  a  single-period 

®http://www. iarpa.gov/index.php/research-programs/ace,  accessed  June  2014. 
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probabilistic  forecasting  system  may  be  represented  as  a  joint  probability  dis¬ 
tribution  fpx  (p,  x)  of  two  random  variables,  a  forecast  P  and  an  associated 
outcome  X,  with  realizations  p  €  [0, 1],  and  x  €  {0, 1}.  There  is  a  large  liter¬ 
ature  addressing  the  question  of  how  to  evaluate  a  single-period  probabilistic 
forecasting  system,  or  a  sample  from  such  a  system.  One  criterion  that  is  widely, 
perhaps  universally,  advocated  is  reliability.  A  forecasting  system  is  perfectly 
reliable  if  P{X  =  x\p)  =  p  Vp,  x.  The  degree  of  reliability  can  be  measured  in 
a  number  of  different  ways,  and  usually  it  is  not  measured  separately  from  the 
second  primary  criterion,  discrimination. 

Discrimination,  also  called  sharpness  or  resolution,  is  the  ability  of  the  fore¬ 
casting  system  to  differentiate  among  events  by  assigning  high  and  low  prob¬ 
abilities,  relative  to  the  base  rate,  E[X].  Discrimination  is  only  valuable  if 
paired  with  a  high  degree  of  reliability.  At  the  extreme,  a  forecasting  system 
that  randomly  issues  forecasts  of  zero  and  one  has  high  discrimination  but  the 
forecasts  would  be  uncorrelated  with  the  outcomes.  The  reverse  is  also  true:  a 
forecasting  system  that  always  forecasts  p  =  E\X]  would  be  perfectly  reliable 
but  uninformative  to  anyone  who  knew  the  base  rate. 

There  are  many  ways  to  measure  imperfect  reliability  and  discrimination. 
Many  scoring  functions,  often  called  scoring  rules,  have  been  proposed  for  eval¬ 
uating  the  overall  performance  of  a  forecasting  system  (or  a  sample).  They 
combine  the  measure  of  reliability  and  discrimination  into  a  scalar  value,  and 
take  the  form  s{p,x).  When  x  has  a  finite  number  of  values,  the  function  can 
be  separated,  and  for  binary  X,  s(p,x)  =  si(p)  -I-  so(p),  where  si(p)  =  s(p,  1) 
and  so(p)  =  s(p,  0).  A  scoring  function  is  positively  (negatively)  oriented  if 
higher  (lower)  scores  reflect  better  performance,  and  therefore  ^^(p)  >  (<)0 
and  Sq(p)  <  (>)0. 

A  large  literature  discusses  the  advantages  and  disadvantages  of  various  scor- 
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ing  functions.  Some  of  this  work  addresses  the  utility  of  the  forecasting  system 
to  a  user  in  a  decision  context  (Murphy  and  Ehrendorfer  1987;  Granger  and 
Pesaran  2000;  Jose,  Nau,  and  Winkler  2008).  Other  work  addresses  desirable 
properties  of  a  forecasting  system  based  on  axiomatic  appeals  (Selten  1998; 
Brocker  and  Smith  2007;  Winkler  1994;  Jose,  Nau,  and  Winkler  2009;  Gneiting 
and  Raftery  2007). 

The  best  choice  of  a  scoring  function  depends  on  the  purpose  of  the  score 
and  the  forecast.  If  the  scoring  function  is  used  to  evaluate  human  forecasters 
(Johnstone,  Jose,  and  Winkler  2011)  or  alternative  models  whose  designers  are 
rewarded  according  to  the  function  (Gneiting  and  Raftery  2007),  and  therefore 
creates  an  incentive  scheme,  then  an  important  criterion  is  that  the  scoring 
rule  is  (strictly)  proper  meaning  that  the  score  is  (uniquely)  maximized  by  a 
probabilistic  forecast  equal  to  the  forecaster’s  subjective  probability,  r,  of  the 
event.  Mathematically,  s{p,x)  is  strictly  proper  li it  satisfies(l).  Even  properness 
is  not  a  universally  accepted  criterion  for  selecting  a  scoring  function.  Bickel 
(2007)  points  out  that  if  the  forecaster’s  utility  function  is  not  linear  in  the 
scoring  function,  a  proper  scoring  function  may  not  create  the  desired  incentives. 
The  meteorology  community  commonly  uses  improper  skill  scores  in  evaluating 
its  forecasts,  to  capture  the  improvement  in  forecast  relative  to  a  baseline,  as  in 
?)  which  uses  the  Brier  skill  score  to  evaluate  the  NEC’s  wind-speed  probability 
forecasting  system.  Research  in  evaluating  single-period  probability  forecasts  is 
active  in  the  meteorology,  decision  science,  and  economics  forecasting  literature. 

definition  of  proper  s( )  :  Er  [s(r,  x)]  >  Er  [s(p,  x)]  t/p  ^  r,  p,r  €  [0, 1]  (1) 

A  few  researchers  have  empirically  compared  the  performance  of  the  same 
forecasting  system  at  different  leads.  Eor  example,  Mellers  et  al.  (2014)  allowed 
subjective  forecasters  to  revise  their  probability  forecasts  over  time  before  an 
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event  was  resolved,  and  found  that  subjective  forecasts  elicited  in  their  experi¬ 
ments  improve  on  average  between  the  first  and  last  week  that  the  forecasters 
were  able  to  forecast.  Clements  (2004)  compared  U.K.  Monetary  Policy  Com¬ 
mittee’s  forecasts  of  the  probability  of  inflation  exceeding  2.5%  in  the  current 
quarter,  the  next  quarter,  and  one  year  ahead,  and  found  improvements  in  Brier 
score  and  log  probability  score  as  the  event  drew  closer.  We  have  not  found  any 
empirical  examination  of  the  relationships  among  the  forecasts  in  a  sequence, 
such  as  a  search  for  inter-period  correlation  or  trends,  or  any  prescriptive  anal¬ 
ysis  of  probability  sequences. 

Nor  are  we  aware  of  any  research  addressing  appropriate  scoring  functions 
for  probability  sequences.  It  is  interesting  to  note  that  Selten  (1998)  clearly 
thought  of  each  single-period  forecast  as  part  of  a  sequence,  but  nevertheless 
limited  his  definition  of  a  scoring  rule  to  “measuring]  the  predictive  success  of 
a  period  for  every  period  separately”  (p.  44). 

In  the  next  section,  we  introduce  notation  and  definitions  for  sequences. 
In  Section  3,  we  propose  an  efficiency  criterion  for  probability  sequences,  and 
use  it  to  derive  properties  of  an  efficient  probabilistic  forecasting  system,  dis¬ 
cussing  some  of  their  implications.  Finally,  we  conclude  with  a  discussion  of 
the  potential  use  of  these  results  for  diagnosing  and  remediating  inefficiency  in 
sequence-forecasting  systems,  both  subjective  and  model-based. 

2  Sequences 

For  a  probability  sequence,  a  forecasting  system  is  a  joint  probability  distribu¬ 
tion  over  a  finite  ordered  sequence  of  T  probabilistic  forecasts,  indexed  by  t. 
Forecasts  pt  G  [0, 1]  are  realizations  of  the  corresponding  random  variables  Pt- 
For  convenience,  we  will  also  write  the  vector  Pj  =  {Pt,  Pt-i,  ■  ■  ■  ,Pt)  and  its 
realization  p*. 
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Treating  larger  t  as  indicating  a  greater  chronological  distance  from  the  out¬ 
come,  t  declines  over  a  sequence,  t  =  T, . . . ,  1  with  forecast  T  being  the  first 
and  t  =  0  corresponding  to  the  actual  outcome  of  the  event  being  forecast.  The 
ordering  of  the  sequence  can  also  be  interpreted  as  conditioning  on  successive 
subsets  of  the  sample  space,  with  no  necessary  relationship  with  timing.  How¬ 
ever,  we  rely  on  the  assumption  that  any  information  available  to  the  generation 
of  forecast  r  is  also  available  for  t  <  r. 

We  disallow  perfect  correlation  between  any  pair  Pt  and  Pr,  t  ^  t  because 
perfect  correlation  effectively  reduces  the  system  to  at  most  a  T  —  1-period 
forecasting  system.  This  reasoning  will  be  discussed  in  more  detail  in  Section 
3.1. 

The  forecasting  system  may  be  denoted  /pT,PT-i.....f’i,A:( )  =  /pi,x(),  and 
may  be  used  to  describe  functions  of  x  and  pt,  such  as  P{X  =  l|pt)  and 
E[s  {Pt,X)]. 

3  An  efficiency  condition  for  probability  sequences 

Criteria  that  apply  to  single-period  probability  forecasts,  such  as  reliability, 
should  also  apply  to  probability  sequences.  However,  sequences  should  satisfy 
additional  criteria.  For  example,  forecasts  should  improve  over  time  with  respect 
to  a  single-period  scoring  function,  as  information  available  to  support  forecasts 
in  early  periods  is  also  available  in  later  periods.  An  important  question  is  how 
to  determine  whether  they  are  improving  enough. 

As  a  basis  for  developing  specific  performance  criteria  for  sequences,  we  use 
a  minimal  condition  for  a  good  forecasting  system:  it  cannot  be  bootstrapped. 

For  a  bootstrap-proof  sequence,  in  each  period  t,  it  is  impossible  to  write  a 
function  with  the  earlier  forecasts  in  the  sequence  itself  as  the  only  arguments 
that  will  score  better  in  expectation  than  the  last  forecast  in  the  sequence. 
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This  is  analogous  to  the  criterion  of  weak  efficiency  for  fixed  event  forecasts 
as  defined  by  Nordhaus  (1987)  in  the  forecasting  of  non-probabilistic  economic 
time  series.  Specifically,  a  forecast  is  weakly  efficient  if  the  forecast  minimizes 
the  expected  error  conditional  on  forecasts  issued  previously. 

We  adopt  the  term  efficient  to  describe  a  bootstrap-proof  system,  and  in 
the  context  of  probability  sequences,  we  operationalize  this  condition  with  the 
following  axiom: 

efficiency  axiom 

Given  a  strictly-proper  positively-oriented  single-period  scoring  function  s(p,  x), 
a  sequence  forecasting  system  is  efficient  if  yt,pt,  $g  (pt)  s.t. 

E  [s  (g  (pt) ,  X)  |pt)]  >  E[s  {pt,  X)  |pt)]  (2) 

The  efficiency  axiom  says  that,  given  complete  knowledge  of  the  forecasting 
system,  its  performance  cannot  be  improved  by  adjusting  its  forecasts  based 
only  on  the  earlier  forecasts  in  the  sequence  itself.  This  is  true  not  just  in 
expectation  over  all  possible  values  of  Pt,  but  conditional  on  any  particular 
sequence  pt. 

Note  that  if  s{p,x)  =  —{p—  x)^,  a  positively-oriented  single-period  function 
equal  to  the  negative  Brier  score  (Brier  1950),  then  the  our  efficiency  condition 
is  exactly  equivalent  to  Nordhaus  (1987)’s  condition  of  weak  efficiency  for  the 
forecast  of  a  binary  variable.  Nordhaus  (1987) ’s  standard  of  efficiency  is  that 
the  forecast  minimizes  the  squared  error,  whereas  the  efficiency  condition  above 
is  defined  with  respect  to  any  strictly  proper  scoring  function. 

Next,  we  derive  properties  of  efficient  sequence-forecasting  systems. 
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3.1  Reliable  and  memory  less 

We  call  a  sequence-forecasting  system  is  each-period  reliable  if  it  satisfies  (3) 
and  memoryless  if  it  satisfies  (4). 

each-period  reliability:  =  l|pt)  =  pt  Vt,pt  (3) 

memorylessness:  P{X  =  l|pt)  =  P(X  =  l\pt)  Vt,pt.  (4) 

Proposition  1  An  efficient  sequence-forecasting  system  is  each-period  reliable 
and  memoryless. 

Proof  By  definition  of  a  strictly  proper  scoring  function,  if  (3)  is  violated,  then 
g{pt)  =  P{X  =  l|pt)  satisfies  (2)  and  the  system  is  inefficient. 

Since  (3)  has  been  shown  for  an  efficient  system,  the  conditioning  on  pt+i 
may  be  suppressed  as  in  (4),  and  an  efficient  system  is  memoryless.  ■ 

Each-period  reliability  is  appealing  as  a  performance  criterion  for  sequences 
because  reliability  is  a  standard  of  optimality  for  single-period  probability  fore¬ 
casts.  An  unreliable  forecasting  system  is  called  poorly  calibrated  and  if  the 
properties  of  the  system  are  known,  it  can  be  calibrated  in  post-processing:  a 
new  forecast  p*  =  g{p)  =  P{X  =  l|p)  can  be  issued  that  is  reliable. 

Memorylessness  is  perhaps  less  intuitive.  An  efficient  system  is  fully  up¬ 
dated  at  each  period  in  that  prior  forecasts  contain  no  additional  information 
independent  the  most  recent  forecast  pt-  If  this  were  not  true,  the  forecasting 
system  could  be  bootstrapped. 

Some  readers  may  find  the  implication  of  (4)  that  P{X  =  V\pt  =  0.9)  = 
P{X  =  l\pi  =  0.9)  counterintuitive  because  a  forecast  oip  =  0.9  (if  far  from  the 
base  rate)  is  highly  informative,  and  if  T  >  1,  more  information  is  anticipated 
to  become  available  between  time  T  and  time  1.  We  would  expect,  however, 
that  an  extreme  forecast  (far  from  the  base  rate)  is  much  less  common  at  time 
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T  than  at  time  1.  If  it  occurs,  it  is  realiable,  but  it  is  unlikely  to  occur,  due  to 
lower  discrimination  of  early  forecasts,  as  discussed  further  in  Section  3.4. 

We  specified  in  the  definition  of  a  forecasting  system  that  no  two  periods’ 
forecasts  may  be  perfectly  correlated.  Perfect  correlation  between  any  Pt  and 
Pt  for  t  T,  together  with  a  reliability  constraint  (3)  on  each  period,  implies 
that  Pt  =  Pt  always,  and  therefore  the  forecasts  are  identical,  and  the  system  is 
at  most  a  length-T  sequence  of  distinct  probability  forecasts. 

3.2  Inter-period  reliable 

A  further  implication  of  each-period  reliability  is  inter-period  reliability.  We 
call  a  forecasting  system  inter-period  reliable  if  it  satisfies  (5). 

inter-period  reliability:  E  [PtlPr]  =  Pr^t  <  t.  (5) 

Proposition  2  An  efficient  forecasting  system  is  inter-period  reliable. 

Proof 


P(X  =  l|p,)  (6a) 

P{X  =  l\pT,pt)  fppp,,{Pt \Pr )  dpt  (6b) 

Ptfpt\p.r  (Pt \Pr)  dpt  =  E  [Pt \pt]  ■  (6c) 

where  (6b)  uses  the  law  of  total  probability.  ■ 

Each-period  reliability  is  also  an  intuitive  property  of  an  efficient  system. 
If  a  system  is  not  inter-period  reliable,  then  a  forecast  giPr)  =  E[Pt\pT]  will 
satisfy  (2). 

Both  inter-period  reliability  and  each-period  reliability  holds  for  a  forecasting 
system  that  is  efficient  with  respect  to  an?/ strictly  proper  scoring  function.  They 
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do  not  depend  on  the  scoring  function. 

3.3  Unpredictable  revisions 

A  direct  consequence  of  inter-period  reliability  is  unpredictability  in  revisions. 
Equation  (7)  gives  specific  statements  of  unpredictability  in  period-t  revision 
Pt-i  -pt- 


Vpi,t  >  1  : 

zero  expected  revisions:  E[Pt-i  —  _pt|pt]  =0  (7a) 

zero  expected  autocorrelation:  E  [{Pt-i  —  Pt)  (pt  —  Pt+i)  |Pi]  =  0,t  <T  (7b) 

Proposition  3  For  an  efficient  sequence-forecasting  system,  expected  revisions 
and  autocorrelation  in  revisions  are  always  zero. 

Although  these  properties  follow  directly  from  inter-period  reliability,  some 
readers  may  find  them  counter-intuitive,  so  we  offer  a  formal  proof. 

Proof 


E[Pt-i-pt\Vt]  =  E[Pt_^\pt]  -Pt  0  ^  (7a) 

E  \{Pt-i  -  Pt)  {pt  -  Pt+i)  |Pi]  =  E  [{Pt-i  -  Pt)  \pt]  (pt  -  Pt-\-i)  0  =l>  (7b) 


The  properties  defined  in  (7)  also  imply  the  weaker  but  more  familiar  prop¬ 
erties: 


E[Pt-i  -  Pt]=0  and 
E[{Pt.i-Pt)iPt-Pt+i)]=0 
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Neither  trends  (positive  autocorrelation)  nor  noise  (negative  autocorrelation) 
is  expected  in  forecast  from  an  efficient  system.  In  an  efficient  system,  there  is 
no  predictability  in  revisions,  conditional  on  prior  forecasts.  If  there  were,  it 
could  be  exploited  to  improve  the  forecast. 

This  result  may  be  counterintuitive  to  some  readers.  (Nordhaus  1987)  offers 
an  intuition  for  this  result: 

efficient  forecasts  appear  jagged  because  they  incorporate  all  news 
quickly.  Inefficient  forecasts  appear  smoother  and  more  consistent, 
for  they  let  the  news  seep  in  slowly,  (p.  669) 

The  fact  that  these  properties  are  not  universally  intuitive  indicates  that 
subjective  probabilistic  forecasts  may  violate  this  criterion  and  therefore  be 
susceptible  to  improvement  as  discussed  further  in  Section  4. 

3.4  Strictly  Improving 

A  forecasting  system  is  strictly  improving  with  respect  to  a  positively-oriented 
scoring  function  s( )  if  it  satisfies  (9). 

strict  improvement:  t  <  t  =>  E  [s{Pt,X)\pr]  >  E  [s{pr,X)\ ,  Vpr,  t  >  1  (9) 

Proposition  4  An  efficient  sequence-forecasting  system  is  strictly  improving 
with  respect  to  any  strictly  proper  scoring  function. 

Proof  The  proof  of  depends  on  the  strict  convexity  of  E  [s(p,  X)|p]  for  proper 
s( )  and  reliable  P,  which  is  shown  in  Appendix  1.1.  For  an  efficient  forecasting 
system,  Pr  is  reliable,  and  therefore  for  proper  s(),  by  Jensen’s  inequality, 

E  [s  {PuX)  Ip,]  >E[s{E  [Pt\pr] ,  X)]  E  [s  (p„  X)]  ^  (9) 
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Note  that  (9)  is  a  stricter  condition  than  t  <  t  =>  E  [s{Pt,  X)]  >  E  [s(Pr,  X)], 
which  is  also  true  for  an  efficient  forecasting  system.  By  (9),  the  expected  score 
for  later  periods  improves  conditional  on  every  value  of  Pr  (and  therefore  con¬ 
ditional  on  every  Pi-). 

The  strictly-improving  property  of  efficient  forecast  systems  means  that  a 
system  that  is  efficient  with  respect  to  any  strictly  proper  scoring  function  is 
strictly  improving  with  respect  to  ffil  strictly  proper  scoring  functions. 

Since  each  period’s  forecasts  are  perfectly  reliable,  the  improvement  must 
come  from  better  discrimination.  Different  scoring  functions  measure  discrimi¬ 
nation  differently.  Conditional  on  reliability,  the  expected  negative  Brier  score 
(which  is  positively  oriented)  is  (10a)  while  the  expected  log  score  is  (10b).  As 
discussed  in  Section  4.2,  the  Brier  score  discrimination  component  is  equal  to 
the  sample  standard  deviation,  a  common  measure  of  dispersion.  For  an  efficient 
forecast  all  of  these  measures  of  dispersion  will  be  increasing  in  expectation  over 
the  forecast  sequence.  Although  large  dispersion  is  usually  associated  with  a  less 
informative  forecast,  in  this  context,  dispersion  increases  with  the  probability 
of  extreme  (far  from  the  base  rate)  forecasts  that  are  also  reliable,  and  therefore 
informative. 


Brier  score  discrimination: 
log  score  discrimination: 


-  [  fip)pi^-p)dp 

Jo 


f{p)  {pln{p)  -b  {1  -  p)ln{l  -  p))  dp 


(10a) 

(10b) 
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II 

o 

bo 

Pi  =  0.2 

P2  =  0.6 
P2=0.4 

0.160  /  0.040 
0.187  /  0.047 

0.020  /  0.080 
0.093  /  0.373 

(a)  Forecasting  system  1. 

II 

o 

bo 

II 

o 

to 

to  10 

II  II 

O  O 

0.120  /  0.000 
0.227  /  0.087 

0.060  /  0.120 
0.053  /  0.333 

(b)  Forecasting  system  2. 


Table  4:  Two  forecasting  systems.  The  values  shown  are  the  joint  probabilities 
PiP2  =  P2,Pi  =  Pi,X  =  1)  / P{P2  =  p2,Pi  =  pi,X  =  0).  Recall  that  forecast 
Pi  is  issued  after  forecast  P2- 


4  Discussion 


4.1  Diagnosing  inefficiency:  an  example 

Each  of  the  properties  of  efficient  sequence-forecasting  systems  derived  in  Sec¬ 
tion  3  is  testable  and  may  be  used  to  diagnose  inefficiency.  A  simple  example 
illustrates  this  for  a  forecasting  system  whose  complete  joint  probability  distri¬ 
bution  is  known.  Table  4  shows  two  two-period  forecasting  systems  for  a  binary 
event,  each  with  a  base  rate,  E[X]  =  0.460.  Table  4  gives  the  distribution  of 
the  forecasts  and  outcome. 

Both  systems  are  each-period  reliable,  i.e.  P{X  =  l\pt)  =  Pt  for  each  value 
of  t  and  of  pt-  For  example,  for  System  1,  shown  in  Table  4a, 


P{X  =  l\p2  =  0.6) 


0.160 -b  0.020 

0.160  -b  0.040  -b  0.020  -b  0.080 


Both  systems  are  also  improving,  with  negative  Brier  scores  —0.240  for  P2 
and  —0.160  for  Pi  (the  negative  Brier  score  is  positively  oriented),  so  the  later 
forecast.  Pi  is  better  than  the  earlier  forecast  as  measured  by  the  Brier  score. 
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Both  systems  are  improving  with  respect  to  any  single-period  score.  Since  the 
two  systems  have  identical  marginal  probabilities  P{Pt  =  Pt)  and  conditional 
probabilities  P{X  =  x\Pt  =  pt)  yt,pt,  x,  and  therefore  they  have  identical  single¬ 
period  expected  scores  E  [s(Pt,X)]  regardless  of  the  choice  of  s(). 

However,  System  1  is  efficient  while  System  2  is  not.  This  can  be  diagnosed 
by  testing  for  inter-period  reliability.  For  System  1,  E[Pi\p2  =  0.6]  =  0.6 
and  E[Pi\p2  =  0.4]  =  0.4,  while  for  System  2,  E[Pi\p2  =  0.6]  =  0.440  and 
E[Pi\P2  =  0.4]  =  0.469.  This  means  System  2  is  bootstrappable  as  illustrated 
in  Section  4.3. 

4.2  Diagnosing  inefficiency  based  on  a  sample 

The  properties  of  an  efficient  forecasting  system  suggest  statistical  tests  for  ef¬ 
ficiency  in  the  more  common  situation  in  which  a  sample  from  the  forecasting 
system  is  available,  but  not  the  complete  distribution.  One  of  the  challenges 
in  evaluating  single-period  probabilistic  forecasts  is  the  data  requirements,  a 
challenge  that  is  compounded  for  probability  sequences.  A  sample  from  a 
sequence-forecasting  system  consists  of  N  observations  where  the  obser¬ 
vation  is  {pt,n,Xn),  consisting  of  T  forecasts  plus  the  outcome  a;„,  for  a 
total  of  iV  X  (r  -I- 1)  observed  values. 

However,  a  sample  from  a  sequence  forecasting  system  may  be  tested  for 
violations  of  the  efficiency  properties  based  on  one  or  two  periods’  forecasts. 
The  inter-period  reliability  property  and  its  corollaries,  zero  expected  revisions 
and  zero  autocorrelation,  are  especially  interesting  because  they  depend  only 
on  the  inter-period  covariance  structure  of  the  forecasts  and  do  not  depend 
on  the  relationship  with  the  outcome  X.  Moreover,  a  forecasting  system  in 
which  realizations  of  the  outcome  are  costly,  delayed,  or  otherwise  difficult  to 
assess,  may  be  evaluated  with  respect  to  violations  of  these  properties  without 
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data  about  the  actual  outcomes.  In  the  context  of  NHC  wind-speed  probability 
forecasts,  for  example,  it  is  sometimes  impossible  to  be  sure  whether  winds 
exceeded  a  given  threshold  because  the  maximum  sustained  wind  might  not 
have  occurred  at  a  functional  measuring  station. 

For  example,  a  null  hypothesis  of  zero  autocorrelation  in  revisions  may  be 
tested  (for  a  given  t)  simply  by  regressing  the  t-period  revisions  against  the 
t  —  1-period  revisions,  and  using  a  Student’s^  t-test  for  a  linear  relationship, 
interpreting  the  p- value  as  the  probability  of  rejecting  the  null  hypothesis  (zero 
autocorrelation)  if  it  holds. 

In  the  remainder  of  this  section,  we  propose  statistics  that  may  be  used  as 
the  basis  of  tests  of  efficiency.  Common  approaches  for  evaluating  single-period 
probability  forecasts  class  forecasts  into  discrete  bins.  Following  this  convention, 
we  classify  pt^n  into  discrete  bins  j  =  1, . . . ,  J,  defined  identically  Vt  (although 
it  is  not  necessary  for  them  to  be  identical).  We  define  b{p)  as  the  index  of  the 
bin  that  probability  p  falls  into,  b{p)  e  {1, . . . ,  J},  and 

'It  O')  = 

n:b{pt^„)=j 

equivalently  the  relative  frequency  of  the  event  X  =  1  conditional  on  the  forecast 
Pt  falling  in  bin  j.  qt{j)  is  also  equal  to  the  within-sample  reliability-calibrated 
forecast  for  bin  j. 

Each-period  reliability  (3)  may  be  tested  for  each  t  using  the  same  methods 
used  for  testing  single-period  probabilistic  forecasts.  Statistical  tests  for  the  null 
hypothesis  of  forecast  reliability  (which  could  be  applied  for  a  given  t  to  test  for 
each-period  reliability)  have  been  suggested.  Brocker  (2012)  suggests  a  family 
of  confidence  intervals  on  q{j)  over  J  bins  as  a  test  for  single-period  reliability. 
A  family  test  for  each-period  reliability  over  many  periods,  based  on  the  null  of 

■^We  refer  to  the  Student’s  i-test  to  avoid  confusion  with  period-t. 
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each-period  reliability  could  be  developed. 

The  statistic  in  (11)  is  a  single-period  measure  of  inter-period  unreliability. 
Under  the  null  hypothesis  of  efficiency,  its  expected  value  is  zero  for  each  t, 
1  <  t  <  T. 


inter-period  reliability  statistic: 

{Pt,n  -  qt-i  ibipt-i,n))  f\  (11) 
where  qt-\  (&(pt_i,„))  is  the  reliability-calibrated  t  —  \  forecast 

Testing  whether  a  forecast  is  improving,  even  for  a  single  pair  of  periods  t 
and  t  —  1,  is  not  entirely  straightforward.  However,  as  discussed  in  Section  3.4, 
if  a  forecasting  system  is  reliable,  improvement  must  come  from  discrimination. 
Moreover,  given  enough  data,  each-period  reliability  may  be  readily  enforced 
by  post-processing.  Therefore,  we  might  prefer  a  statistic  that  separates  the 
measure  of  discrimination  from  the  measure  of  reliability.  Conveniently,  per 
Proposition  4,  an  efficient  forecast  is  improving  with  respect  to  any  strictly 
proper  scoring  function.  This  suggests  that  a  test  for  the  difference  in  the 
discrimination  component  of  any  scoring  function  would  be  a  useful  test  for  the 
strictly  improving  property  (sufficiency  of  later  forecasts). 

The  Brier  score  may  be  decomposed  into  a  reliability,  discrimination,  and 
variability  components,  as  in  Wilks  (2011)  (p.  333).  The  sample  estimate 
of  the  discrimination  components  is  shown  in  (12).  Large  values  of  (12)  lead 
to  better  Brier  scores.  Using  qt{j)  instead  of  means  that  (12)  estimates 
the  standard  deviation  of  the  reliability-calibrated  forecast.  For  an  efficient 
forecasting  sequence,  therefore,  (12)  should  be  decreasing  in  t.  A  null  hypothesis 
that  (12)  is  not  improving  (the  system  is  inefficient)  would  assume  that  (12) 
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evaluated  at  r  and  at  t  <  t  are  equal. 


discrimination  statistic:  —  E  'mt(j)  iqt{j)  -  x) 


(12) 


N 


where  rnt{j)  =  1  and  x  =  ; 


At  the  limit  as  J  increases,  (12)  is  equal  to  the  sample  standard  deviation 
of  the  reliability-calibrated  forecast.  For  a  reliable  forecast,  the  sequence  is 
improving  in  discrimination  if  its  standard  deviation  is  increasing,  and  statistical 
tests  for  differences  of  standard  deviation  may  be  applied  for  each  t. 

The  statistics  in  (12)  and  (11)  can  be  used  as  the  basis  for  designing  hy¬ 
pothesis  tests  for  each  t.  For  multiple  single-period  tests,  an  appropriate  test  of 
the  family  of  inferences  could  be  designed.  Novel  statistical  tests  for  efficiency 
of  sequence-forecasting  systems  may  be  a  productive  area  for  future  research. 

The  diagnosis  of  inefficiency  is  essentially  a  search  for  unexploited  structure 
in  the  forecasts.  Any  predictability  that  is  not  captured  in  the  forecast  can  be 
used  in  post-processing,  or  point  to  a  way  to  improve  the  forecasting  system. 
Patterns  such  as  conditional  bias  that  might  not  show  up  in  an  average  over  j, 
may  also  be  of  interest,  such  as  trends  that  are  a  function  of  pt- 


4.3  Remediation 

Once  diagnosed,  inefficiency  can  be  remedied  in  post-processing.  For  subjec¬ 
tive  forecasts,  post-processing,  such  as  debiasing  to  correct  for  over-precision, 
is  already  a  common  intervention  for  subjective  single-period  probability  fore¬ 
casts.  Similarly,  in  ensemble-based  meteorological  forecasting,  a  correction  for 
under-dispersion  (insufficient  spread)  in  the  ensemble  is  common.  In  the  con¬ 
text  of  probability  sequences,  empirical  or  parametric  functions  of  the  form 
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Pi 

II 

0 

bo 

II 

0 

to 

P2 

P2 

=  0.6 

1.000 

0.333 

=  0.4 

0.723 

0.138 

Table  5:  Conditional  probabilities  for  System  2  using  both  periods’  forecasts. 
The  values  shown  are  P{X  =  l|p2,Pi)- 


PtiPt)  =  P{X  =  l|pt)  estimated  from  the  available  sample  from  the  forecasting 
system  may  be  used  to  generate  better  forecasts,  such  that  E  [s{p*  (pt),  X)]  > 
E[sipt,X)]. 


post-processing  function:  pl{pt)  =  P{X  =  l|p()  (13) 

For  example,  System  2  in  Table  4b  can  be  improved  in  period  t  =  1  by  post¬ 
processing  using  the  information  from  Pi  and  P2  ■  Table  5  shows  the  conditional 
probabilities  P{X  =  l|pi,p2)  for  System  2.  A  modified  System  2  using  the 
probabilities  in  Table  5  in  place  of  pi,  i.e.  Pi(pi)  =  P{X  =  l\p2,pi),  has  a 
period-1  negative  Brier  score  of  —0.149,  which  is  better  System  2’s  period-1 
negative  Brier  score  of  —0.160. 

Perhaps  more  interesting,  a  diagnosis  of  inefficiency  implies  that  the  fore¬ 
casting  system  does  not  fully  incorporate  the  available  information  and  is  sus¬ 
ceptible  to  improvement  and  therefore  may  benefit  from  a  re-examination  of  the 
forecasting  process,  motivated  and  informed  by  the  results  of  efficiency  tests. 

For  subjective  forecasts,  diagnostic  test  results  could  be  provided  to  the 
forecasters  as  feedback  to  help  them  self-calibrate  and  motivate  a  search  for 
more  valid  predictors  or  better  synthesis  of  information. 

For  dynamical  model-based  forecasts,  diagnostic  tests  showing  inefficiency 
suggest  a  search  for  improving  the  underlying  dynamical  model  or  the  addition 
of  statistical  post-processing  to  exploit  any  detectable  structure  in  the  system. 
For  statistical  models,  the  implications  are  the  same — for  example,  it  may  sug- 
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gest  the  possibility  of  a  prior  distribution  that  the  model  does  not  capture,  or 
predictors  used  in  early  forecasts  that  provide  independent  information  that 
could  be  exploited  in  later  forecasts. 

4.4  Implications  for  scoring  functions 

Scoring  functions  that  are  separable  functions  of  single-period  scores  do  not 
diagnose  violations  of  efficiency.  For  example,  a  separable  scoring  function  would 
not  detect  inter-period  structure  such  as  regression  to  the  base  rate  or  other 
trends  that  could  be  removed  in  post-processing.  In  fact,  the  two  systems  shown 
in  Table  4  have  identical  marginal  distributions  P{Pt  =  Pt),  ^t,Pt,  therefore 
would  have  identical  single-period  scores  regardless  of  the  scoring  function.  Tests 
that  depend  on  the  correlation  structure  are  necessary  to  diagnose  efficiency. 

This  suggests  that  sequence-scoring  functions  of  a  form  that  depend  on  the 
inter-period  structure  of  the  system,  which  excludes  functions  of  the  form  (14), 
may  be  desirable.  The  question  of  which  functions  are  appropriate  in  a  given 
situation  is  open. 


T  ^  N 

s{pt,n,Xn)  (14) 

t—1  n—1 

After  more  than  60  years  of  research  on  probabilistic  forecasting,  there  is  no 
consensus  on  the  best  single-period  scoring  function,  and  ongoing  research  ex¬ 
plores  the  relative  merits  of  various  scoring  functions  in  particular  contexts.  For 
example,  an  interesting  difference  arises  from  the  consideration  of  probability 
forecasts  used  to  make  investment  decisions  in  competitive  markets,  where  the 
purchase  price  of  an  asset  depends  on  the  market  (or  a  competitor’s)  probability 
forecast,  and  small  differences  in  extreme  (small  or  large)  probabilities  are  very 
important.  Hence  Jose,  Nau,  and  Winkler  (2008)  do  not  adopt  Selten  (1998)’s 
criterion  that  a  scoring  rule  should  not  be  hypersensitive  (defined  p.  50)  to 

21 


Regnier  August  18,  2014 


DRMI  WORKING  PAPER 


small  differences.  The  meteorology  community  commonly  uses  skill  scores  that 
are  not  proper  in  evaluating  its  forecasts  (Gneiting  and  Raftery  2007).  Skill 
scores  adjust  for  a  measure  of  difficulty  in  making  the  forecast,  often  a  reference 
or  baseline  forecasting  system.  In  that  context,  the  benefits  of  this  adjustment 
outweigh  any  incentive  for  untruthful  forecasting  (or  biased  models).  We  antic¬ 
ipate  that  a  similar  breadth  of  multi-period  scoring  functions  would  be  useful 
and  appropriate  for  different  forecasting  and  decision  contexts. 

The  decision  context  in  which  the  forecasting  system  might  be  used  could 
determine  the  relative  importance  of  its  performance  over  t,  for  example,  in¬ 
forming  the  selection  of  a  scoring  function.  Moreover,  it  is  very  possible  that  a 
decision  maker  could  benefit  more  from  using  a  particular  inefficient  forecasting 
system,  instead  of  an  alternative  efficient  forecasting  system.  However,  the  inef¬ 
ficient  forecasting  system  is  clearly  improvable,  and  therefore  although  it  may  be 
superior  in  user  value  to  other  efficient  forecasting  systems,  it  shoufd  not  satisfy 
a  user  (or,  for  that  matter,  a  forecaster)  because  its  inefficiency  demonstrates 
that  it  is  possibfe  to  create  a  system  that  will  perform  even  better. 

5  Conclusion 

Probability  sequences  are  becoming  more  common  and  relevant  to  decision  mak¬ 
ers  in  many  domains,  including  finance  and  intelligence.  To  our  knowledge, 
this  is  the  first  formal  examination  of  the  inter-period  behavior  of  probabil¬ 
ity  sequences.  We  showed  several  properties  of  an  efficient  sequence  forecast¬ 
ing  system,  i.e.  that  it  is  reliable  for  each  period,  inter-period  reliable  and 
therefore  memoryless,  always  has  zero  expected  revision,  and  strictly  improv¬ 
ing  with  respect  to  any  strictly  proper  scoring  function.  Some  of  these  prop¬ 
erties  are  intuitive,  but  others — in  particular  inter-period  reliability  and  zero 
autocorrelations — are  not.  Analogous  conditions  of  efficiency  are  commonfy  vi- 

22 


Regnier  August  18,  2014 


DRMI  WORKING  PAPER 


dated  in  forecasting  of  non-probability  variables  and  therefore  we  should  expect 
to  hnd  probability  sequences  that  violate  these  properties  and  are  susceptible 
to  improvement. 
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1  Appendix 

1.1  Convexity  of  expected  score 

For  differentiable  s( ),  properness  (1)  implies  that  -^Er  [s(p,  X)]  =  0  when  eval¬ 
uated  at  p  =  r,  and  therefore,  for  propert  s(/, ), 


s'ljp) 

s'oip) 


4(p) 


s'oip) 

p 


(15) 


For  reliable  P, 


E[s(p,X)\p]  =Ep[s{p,X)\  =  psx{p)  +  (I  -  p)so(p) 

^E[s(p,X)\p]=  ps\(p)  +  si(p)  +  (I  -  p)s'q(p)  -  so(p) 

p(^^ip)-  +si(p)  +  (1 -p)so(p)  -  so(p) 

=  Slip)  -  so(p) 

For  positively-oriented  s( ),  recall  that  s{(p)  >  0  and  Sq(p)  <  0,  so 
^E  [s(p,X)\p]  =  s[ip)  -  So(p)  >  0 

and  therefore,  for  positively-oriented,  proper  5()  and  reliable  P,  E  [s(p,  X)\p] 
is  convex  in  p. 
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